In 2015, the United Nations (UN) approved the 2030 Agenda for Sustainable Development, encompassing the 17 Sustainable Development Goals (SDGs) to achieve a better and more sustainable future for the people and the planet. The SDGs address the global challenges, including those related to poverty, inequality, climate change, environmental degradation, peace and justice.
In this sense, the objective of this report is to classify countries globally based on their assessment of the Sustainable Development Goals into homogeneous groups. The aim is to understand the main disparities between countries and identify the areas that struggle to achieve the goals in relation to the socioeconomic and political structure of the countries. Accordingly, the report focuses on analysing whether those groups of countries sharing a similar progress in the17 SDGs, also converge in terms of socioeconomic and political characteristics. Therefore, after classifying countries into homogeneous groups, each cluster will be examined based on their income level and socioeconomic factors to analyse to which extent the structures of the economies affect the achievement of the SDGs.
The 17 SDGs are listed below:
• SGD 1: No Poverty
• SGD 2: Zero Hunger
• SGD 3: Good Health and Well-being
• SGD 4: Quality Education
• SDG 5: Gender Equality
• SDG 6: Clean Water and Sanitation
• SDG 7: Affordable and Clean Energy
• SDG 8: Decent Work and Economic Growth
• SDG 9: Industry, Innovation and Infrastructure
• SGD 10: Reduced Inequalities
• SDG 11: Sustainable Cities and Communities
• SDG 12: Responsible Consumption and Production
• SDG 13: Climate Action
• SDG 14: Life Bellow Water
• SDG 15: Life on Land: Protect
• SDG 16: Peace, Justice and Strong Institutions
• SGD 17: Partnership for the Goals
Libraries
library(tidyverse)
library(GGally)
library(factoextra)
library(countrycode)
library(rworldmap)
library(mice)
library(plotly)
library(dplyr)
library(readr)
library(readxl)
library(gplots)
library(cluster)
library(mclust)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(tidyr)
library(Hmisc)
library(RColorBrewer)
To perform the analysis, data for 152 countries are included in the
report. The SDG progress data, corresponding to the 17 goals, are
obtained from the Sustainable Development Report 2022 gathered
from the SDG Indicators Database. Additionally, some
socioeconomic and political indicators are obtained from The Wold
Bank to complement our analysis. In particular, information
regarding GDP per capita, Government Effectiveness and Income level is
considered in the report.
We begin by loading the information on the progress of the countries, including the 17 scores for each goal and the Overall Score per country.It should me noted that only data on the 17 SDGs will be considered for both PCA and Clustering analysis, while the Overall Score will only be used during the descriptive analysis to provide an overview of the countries’ overall progress in terms of environmental development.
report2022 <- read.csv("Sustainable_Development_Report_2022.csv")
report2022<- report2022 %>%
dplyr::select(Name, ID, Overall_Score, starts_with("Goal_") & ends_with("Score"))
head(report2022)
## Name ID Overall_Score Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## 1 Finland FIN 86.50874 99.8215 64.17350 94.71350 98.05167
## 2 Denmark DNK 85.63330 99.6960 66.39612 95.42814 97.73400
## 3 Sweden SWE 85.18928 98.8810 63.41087 95.71892 99.87567
## 4 Norway NOR 82.34929 99.5130 60.37638 97.24957 97.60700
## 5 Austria AUT 82.31520 99.3415 73.70100 91.94857 98.24167
## 6 Germany DEU 82.17874 99.5335 72.57800 93.77714 97.29967
## Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1 91.07725 93.6282 89.02200 87.66233 94.41333
## 2 86.80300 89.8198 88.13250 88.88167 96.40733
## 3 90.91900 95.0576 93.28975 83.86317 97.32617
## 4 90.39525 84.9070 96.76025 83.79250 91.33417
## 5 82.85667 92.3754 85.21000 84.04100 95.67200
## 6 80.49150 88.5496 76.59475 86.91250 93.43133
## Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1 98.4375 92.04550 70.24829 60.22333 85.12733
## 2 98.3890 95.06775 54.80729 58.53600 71.32233
## 3 93.3540 92.01250 63.09129 60.23900 67.26467
## 4 99.8590 94.00575 50.77643 20.64167 73.90250
## 5 93.7895 93.01467 56.78743 55.30133 NA
## 6 89.1155 90.90825 59.36029 55.58100 67.64233
## Goal_15_Score Goal_16_Score Goal_17_Score
## 1 84.9884 94.1074 72.90750
## 2 92.8168 93.2551 82.27325
## 3 80.1226 86.5778 87.21375
## 4 73.7214 90.4535 94.64250
## 5 73.4606 91.2038 68.70000
## 6 79.1014 84.1893 81.97250
Data on GDP per capita and Government Effectiveness is also incorporated for each country for the year 2022. Including information about GDP can provide insights into the economic structure of different countries, allowing us to assess the economic development levels across nations. Furthermore, considering government effectiveness data is crucial for understanding how governance quality of countries may impact their ability to achieve sustainable development goals.
indicators <- read_excel("World_Development_Indicators.xlsx",
col_types = c("text", "text", "numeric", "numeric"))
indicators <- indicators %>%
dplyr::rename(Country_Name = `Country Name`, Country_Code = `Country Code`)
str(indicators)
## tibble [217 × 4] (S3: tbl_df/tbl/data.frame)
## $ Country_Name : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Country_Code : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
## $ GovernmentEffectiveness: num [1:217] -1.8796 0.0651 -0.5131 0.6679 1.4953 ...
## $ GDPpercapita : num [1:217] NA 6810 4343 19673 41993 ...
head(indicators)
## # A tibble: 6 × 4
## Country_Name Country_Code GovernmentEffectiveness GDPpercapita
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan AFG -1.88 NA
## 2 Albania ALB 0.0651 6810.
## 3 Algeria DZA -0.513 4343.
## 4 American Samoa ASM 0.668 19673.
## 5 Andorra AND 1.50 41993.
## 6 Angola AGO -1.04 3000.
Finally, we incorporate The World Bank’s country classification based on four income groups: low, lower-middle, upper-middle, and high income. This classification provides additional information to our analysis by categorizing countries according to their income levels, which can provide context about the economic structure and development of each nation.
class <- read_excel("class.xlsx")
class <- class %>%
dplyr::rename(Income_group = `Income group`)
Once we have obtained the datasets containing all the variables, we need to clean and apply some transformation in order to prepare the data for the analysis.
An essential step is to address missing values in our dataset. We begin by examining countries with missing values in the ‘Overall Score’ variable, as it implies that the corresponding 17 SDG indicators are also missing due to the small size of these countries.
# analyse missing values
sapply(report2022, function(x) sum(is.na(x))*100/nrow(report2022))
## Name ID Overall_Score Goal_1_Score Goal_2_Score
## 0.00000 0.00000 19.68912 24.87047 19.68912
## Goal_3_Score Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score
## 19.68912 20.20725 19.68912 19.68912 19.68912
## Goal_8_Score Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score
## 19.68912 19.68912 27.97927 19.68912 19.68912
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score
## 19.68912 40.41451 19.68912 19.68912 19.68912
na_counts <- colSums(is.na(report2022))
print(na_counts)
## Name ID Overall_Score Goal_1_Score Goal_2_Score
## 0 0 38 48 38
## Goal_3_Score Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score
## 38 39 38 38 38
## Goal_8_Score Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score
## 38 38 54 38 38
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score
## 38 78 38 38 38
columns_with_na <- names(na_counts[na_counts > 0])
print(columns_with_na)
## [1] "Overall_Score" "Goal_1_Score" "Goal_2_Score" "Goal_3_Score"
## [5] "Goal_4_Score" "Goal_5_Score" "Goal_6_Score" "Goal_7_Score"
## [9] "Goal_8_Score" "Goal_9_Score" "Goal_10_Score" "Goal_11_Score"
## [13] "Goal_12_Score" "Goal_13_Score" "Goal_14_Score" "Goal_15_Score"
## [17] "Goal_16_Score" "Goal_17_Score"
# remove missing Overall Scores
report2022 <- report2022[!is.na(report2022$Overall_Score), , drop = FALSE]
na_counts <- colSums(is.na(report2022))
print(na_counts)
## Name ID Overall_Score Goal_1_Score Goal_2_Score
## 0 0 0 10 0
## Goal_3_Score Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score
## 0 1 0 0 0
## Goal_8_Score Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score
## 0 0 16 0 0
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score
## 0 40 0 0 0
Additionally, we observe missing values corresponding to specific goals. In this case, we apply automatic imputation of NAs, so that the missing scores are replaced with prediction.
m = 4
mice_mod <- mice(report2022, m=m, method='rf')
##
## iter imp variable
## 1 1 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 1 2 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 1 3 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 1 4 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 2 1 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 2 2 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 2 3 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 2 4 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 3 1 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 3 2 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 3 3 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 3 4 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 4 1 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 4 2 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 4 3 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 4 4 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 5 1 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 5 2 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 5 3 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
## 5 4 Goal_1_Score Goal_4_Score Goal_10_Score Goal_14_Score
report2022 <- complete(mice_mod, action=m)
Finally, we apply the same procedure and substitute missing values on
GDP per capita and Government Effectiveness with prediction.
m = 4
mice_mod <- mice(indicators, m=m, method='rf')
##
## iter imp variable
## 1 1 GovernmentEffectiveness GDPpercapita
## 1 2 GovernmentEffectiveness GDPpercapita
## 1 3 GovernmentEffectiveness GDPpercapita
## 1 4 GovernmentEffectiveness GDPpercapita
## 2 1 GovernmentEffectiveness GDPpercapita
## 2 2 GovernmentEffectiveness GDPpercapita
## 2 3 GovernmentEffectiveness GDPpercapita
## 2 4 GovernmentEffectiveness GDPpercapita
## 3 1 GovernmentEffectiveness GDPpercapita
## 3 2 GovernmentEffectiveness GDPpercapita
## 3 3 GovernmentEffectiveness GDPpercapita
## 3 4 GovernmentEffectiveness GDPpercapita
## 4 1 GovernmentEffectiveness GDPpercapita
## 4 2 GovernmentEffectiveness GDPpercapita
## 4 3 GovernmentEffectiveness GDPpercapita
## 4 4 GovernmentEffectiveness GDPpercapita
## 5 1 GovernmentEffectiveness GDPpercapita
## 5 2 GovernmentEffectiveness GDPpercapita
## 5 3 GovernmentEffectiveness GDPpercapita
## 5 4 GovernmentEffectiveness GDPpercapita
indicators <- complete(mice_mod, action=m)
summary(indicators)
## Country_Name Country_Code GovernmentEffectiveness GDPpercapita
## Length:217 Length:217 Min. :-2.38987 Min. : 259
## Class :character Class :character 1st Qu.:-0.74527 1st Qu.: 2255
## Mode :character Mode :character Median :-0.09709 Median : 6984
## Mean :-0.04102 Mean : 19095
## 3rd Qu.: 0.65083 3rd Qu.: 25057
## Max. : 2.14483 Max. :240862
After removing the missing values of the datasets containing our variables of interest—SDG scores, socioeconomic indicators, and country income group information—we merge the data into a single dataset.
merged_data <- indicators %>%
inner_join(class, by = c("Country_Code"="Code"))
merged_data <- merged_data %>%
inner_join(report2022, by = c("Country_Name"="Name"))
We begin by removing unnecessary columns to reduce the dimensions of our dataset and eliminate irrelevant information.
merged_data <- merged_data %>%
dplyr::select(-"Lending category", -"Economy", -"ID")
Next, we examine our data to ensure that the variables are in the correct format.
str(merged_data)
## 'data.frame': 152 obs. of 24 variables:
## $ Country_Name : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Country_Code : chr "AFG" "ALB" "DZA" "AGO" ...
## $ GovernmentEffectiveness: num -1.8796 0.0651 -0.5131 -1.0404 -0.2829 ...
## $ GDPpercapita : num 650 6810 4343 3000 13651 ...
## $ Region : chr "South Asia" "Europe & Central Asia" "Middle East & North Africa" "Sub-Saharan Africa" ...
## $ Income_group : chr "Low income" "Upper middle income" "Lower middle income" "Lower middle income" ...
## $ Overall_Score : num 52.5 71.6 71.5 50.9 72.8 ...
## $ Goal_1_Score : num 11.8 94.3 97.4 12.9 96.6 ...
## $ Goal_2_Score : num 51.6 59.9 58.4 55.8 67.5 ...
## $ Goal_3_Score : num 38.1 82.9 76.1 34.8 79.2 ...
## $ Goal_4_Score : num 34.4 94.3 87.7 42.2 97.3 ...
## $ Goal_5_Score : num 21.7 53.2 53.4 50.3 81.2 ...
## $ Goal_6_Score : num 50.4 74.3 60.4 54.3 79.1 ...
## $ Goal_7_Score : num 44.1 81.3 65.3 63.6 72.4 ...
## $ Goal_8_Score : num 33.8 59.1 60.9 52.9 65.7 ...
## $ Goal_9_Score : num 7.44 31.11 46.58 11.25 48.45 ...
## $ Goal_10_Score : num 75.5 80.3 97 16.5 43.7 ...
## $ Goal_11_Score : num 29.3 74.5 57.8 47.6 82.1 ...
## $ Goal_12_Score : num 97.7 86.8 91.4 95.1 82.7 ...
## $ Goal_13_Score : num 98.8 88.5 88.6 96.8 88.2 ...
## $ Goal_14_Score : num 51.3 42.8 63.7 68.3 63.3 ...
## $ Goal_15_Score : num 52.9 80 69.9 66.5 61.2 ...
## $ Goal_16_Score : num 49.2 68.7 72.4 49 65.4 ...
## $ Goal_17_Score : num 42.9 65.7 69.3 48.3 63.2 ...
Since scores and country indicators should be treated as numeric, we verify their formatting accordingly. Additionally, we convert the variable ‘Income_group’ into a factor variable.
merged_data <- merged_data %>%
mutate(Income_group = fct_relevel(Income_group,"Low income","Lower middle income", "Upper middle income", "High income" ))
Following these steps, we have obtained a dataset prepared for analysis containing 24 variables and 152 observations, each representing a country.
dim(merged_data)
## [1] 152 24
In order to get a deeper insight of our data, we start computing some statistics for our variables. Given the results displayed below, we observe that the maximum GDP value is 125.006 and minimum GDP value is 259, indicating that there is at least one observation with a extremely high value and other observation with a notably low value for GDP. Additionally, the third quantile indicates that 75% of the countries have a GDP per capita equal or lower to 21.387. These measures suggest that the variable GDP widely varies across observations and that there is a likely presence of outliers.
data_summary <- merged_data %>%
summarise(
avg_GDPpercapita = mean(GDPpercapita),
median_GDPpercapita = median(GDPpercapita),
max_GDPpercapita = max(GDPpercapita),
min_GDPpercapita = min(GDPpercapita),
sd_GDPpercapita = sd(GDPpercapita),
Q1_GDPpercapita = quantile(GDPpercapita, probs = 0.25),
Q3_GDPpercapita = quantile(GDPpercapita, probs = 0.75),
avg_GovernmentEffectiveness = mean(GovernmentEffectiveness),
median_GovernmentEffectiveness = median(GovernmentEffectiveness),
max_GovernmentEffectiveness = max(GovernmentEffectiveness),
min_GovernmentEffectiveness = min(GovernmentEffectiveness),
sd_GovernmentEffectiveness = sd(GovernmentEffectiveness),
Q1_GovernmentEffectiveness = quantile(GovernmentEffectiveness, probs = 0.25),
Q3_GovernmentEffectiveness = quantile(GovernmentEffectiveness, probs = 0.75),
avg_Overall_Score = mean(Overall_Score),
median_Overall_Score = median(Overall_Score),
max_Overall_Score = max(Overall_Score),
min_Overall_Score = min(Overall_Score),
sd_Overall_Score = sd(Overall_Score),
Q1_Overall_Score = quantile(Overall_Score, probs = 0.25),
Q3_Overall_Score = quantile(Overall_Score, probs = 0.75)
) %>%
pivot_longer(
cols = everything(),
names_to = c(".value", "variable"),
names_sep = "_"
)
data_summary
## # A tibble: 3 × 8
## variable avg median max min sd Q1 Q3
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 GDPpercapita 1.77e+4 6741. 1.25e5 259. 2.44e+4 2147. 2.14e+4
## 2 GovernmentEffectiven… -4.59e-2 -0.133 2.14e0 -2.39 9.76e-1 -0.748 5.95e-1
## 3 Overall 6.73e+1 69.3 8.65e1 39.0 1.02e+1 60.1 7.46e+1
Based on the previous results, we apply a logarithmic transformation to the variable GDP per capita. This transformation is aimed at addressing outliers and the highly asymmetric distribution observed in the data. By taking the logarithm of GDP per capita values, we aim to normalize the distribution and reduce the impact of extreme values, which can distort statistical analyses and modeling techniques.
merged_data <- merged_data %>%
mutate(log_GDP = log(GDPpercapita)) %>%
relocate(log_GDP, .after = GDPpercapita)
merged_data %>%
dplyr::select(Country_Name, log_GDP, Overall_Score) %>%
summarise(max_GDPpercapita = max(log_GDP),
country_with_max_GDP = Country_Name[which.max(log_GDP)],
Overall_Score = Overall_Score[which.max(log_GDP)])
## max_GDPpercapita country_with_max_GDP Overall_Score
## 1 11.73612 Luxembourg 75.74422
merged_data %>%
dplyr::select(Country_Name, log_GDP, Overall_Score) %>%
summarise(min_GDPpercapita = min(log_GDP),
country_with_min_GDP = Country_Name[which.min(log_GDP)],
Overall_Score = Overall_Score[which.min(log_GDP)])
## min_GDPpercapita country_with_min_GDP Overall_Score
## 1 5.556925 Burundi 54.0531
There are 21 variables in the dataset of our report (17 SDG scores, Overall Score, GDP, Government Effectiveness and Income Group) . In order to get an idea of the main characteristics of the data, we focus on the socioeconomic variables, which help us to define the countries. In this sense, we plot the distribution of the following variables: GDP per capita, Government Effectiveness and Income Group, where quantitative variables are displayed in Histograms, while qualitative features are represented in Bar Plots.
In the plot below, we can observe the right-skewed distribution of GDP per capita, confirming our previous conclusions that the variable is characterized by a highly asymmetric distribution. This skewness suggest that while the majority of countries have relatively lower GDP per capita values, there are a few countries with extremely high values, leading to a long tail on the right side. To address this skewness, we apply a logarithmic scale to the GDP per capita values and plot the distribution, where we can observe a more symmetric and smooth distribution across countries.
Lastly, we illustrate the distribution of Government Effectiveness, which is an important indicator of the quality of governance within countries. The symmetric distribution suggests a relatively balanced distribution of governance quality across countries, with most nations falling within a similar range of effectiveness scores.
box_gdp <-
ggplot(merged_data, mapping = aes(x=GDPpercapita)) +
geom_histogram(bins=15,fill="#756bb1", aes(y=..count../sum(..count..)))
box_gdp_log <-
ggplot(merged_data, aes(x = log(GDPpercapita))) +
geom_histogram(bins = 15, fill = "#bcbddc",
aes(y = ..count.. / sum(..count..))) +
labs(x = "log(GDP per capita)", y = "Density", title = "Histogram of log(GDP per capita)")
box_govn <-
ggplot(merged_data, mapping=aes(x=GovernmentEffectiveness))+
geom_histogram(bins=15,fill="#c994c7",aes(y=..count../sum(..count..)))
ggarrange(box_gdp, box_gdp_log, box_govn,
ncol = 3, nrow = 1)
The bar plot below indicates that our data contain a relatively higher number of observation classified as high-income countries, while the number of those classified as low-income countries is relatively low. In relation to middle categories, both groups are quite balanced in our data.
color_palette <- brewer.pal(n = length(unique(merged_data$Income_group)), name = "Set2")
box_group <- merged_data %>% ggplot(aes(x = reorder(Income_group, Income_group, length))) +
geom_bar(aes(fill = Income_group)) +
labs(caption = "Countries per Income group",
x = "", y = "") +
theme(legend.position = "none") +
scale_fill_manual(values = color_palette)
box_group
In the following bar plot we observe the number of observations located in each of the 7 world areas, where Europe and Central Asia are the one containing more countries, while the North America and South Asia contain a very low number of regions.
color_palette <- brewer.pal(n = length(unique(merged_data$Region)), name = "Set2")
region_box <- merged_data %>% ggplot(aes(x=reorder(Region, Region, length))) +
geom_bar(aes(fill=Region)) +
labs(caption="Countries per Region",
x = "", y = "")+
theme(legend.position="none") +
scale_fill_manual(values = color_palette)
region_box
In the map below, a clear clear pattern emerges: more developed countries, particularly those in Europe, North America, and Australia, exhibit higher levels of government quality. In contrast, less developed and wealthy countries, such as those in Latin America or Africa, demonstrate lower government effectiveness.
map = merged_data %>% dplyr::select(Country_Name, GovernmentEffectiveness)
map$country = countrycode(map$Country_Name, 'country.name', 'iso3c')
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="GovernmentEffectiveness",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "topo",
mapTitle = c("Government Effectiveness by Country"), lwd=1)
In relation to GDP, a clear distinction between regions in terms of wealth is evident. In this context, wealthier countries are primarily located in Europe, North America, Australia, and New Zealand, followed by Asia and South America. Conversely, poorer countries, as expected, are concentrated in developing regions, such as Africa and parts of Asia.
map = merged_data %>% dplyr::select(Country_Name, log_GDP)
map$country = countrycode(map$Country_Name, 'country.name', 'iso3c')
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="log_GDP",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "topo",
mapTitle = c("GDP per Country"), lwd=1)
In order to visualize the distribution of the assessment of the Sustainability Development Goals, we create histograms for each of the 17 scores and the Overall Score, allowing us to explore the performance of the countries.
The distribution of the overall score across countries suggest two undefined patterns. One part of the plot depicts a normal and symmetric distribution, while on the left side, we can notice a small group of countries characterized by extremely low overall scores.
plot_score <-
ggplot(merged_data,mapping=aes(x=Overall_Score))+
geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
print(plot_score)
By observing the distributions of each of the 17 Suistainability Development Goals scores, we notice that one one hand, Goals 2, Goal 3, Goal 5, Goal 6, Goal 8, Goal 9, Goal 10, Goal 11, Goal 14, Goal 15, Goal 16, and Goal 17 depict a normal and symmetric distribution. On the other hand, Goal 1, Goal 4, Goal 7, Goal 12, and Goal 13 suggest a left skewed distribution, implying that most countries depict high values while a few have extremely low values. These findings point out the varying levels of achievement across different sustainable development targets, highlighting areas where some efforts may be needed to address disparities and improve overall global progress.
box_1<-ggplot(merged_data,mapping=aes(x=Goal_1_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_2<-ggplot(merged_data,mapping=aes(x=Goal_2_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_3<-ggplot(merged_data,mapping=aes(x=Goal_3_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_4<-ggplot(merged_data,mapping=aes(x=Goal_4_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_5<-ggplot(merged_data,mapping=aes(x=Goal_5_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_6<-ggplot(merged_data,mapping=aes(x=Goal_6_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_7<-ggplot(merged_data,mapping=aes(x=Goal_7_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_8<-ggplot(merged_data,mapping=aes(x=Goal_8_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_9<-ggplot(merged_data,mapping=aes(x=Goal_9_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_10<-ggplot(merged_data,mapping=aes(x=Goal_10_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_11<-ggplot(merged_data,mapping=aes(x=Goal_11_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_12<-ggplot(merged_data,mapping=aes(x=Goal_12_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_13 <- ggplot(merged_data, aes(x = Goal_13_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_14 <- ggplot(merged_data, aes(x = Goal_14_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_15 <- ggplot(merged_data, aes(x = Goal_15_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_16 <- ggplot(merged_data, aes(x = Goal_16_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_17 <- ggplot(merged_data,mapping=aes(x=Goal_17_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
ggarrange(box_1, box_2, box_3,
box_4,box_5, box_6,
box_7, box_8, box_9,
ncol = 3, nrow = 3)
ggarrange(box_10, box_11,
box_12,box_13,
ncol = 2, nrow = 2)
ggarrange(box_14,box_15, box_16, box_17,
ncol = 2, nrow = 2)
In order to analyse the behaviour of our variables in relation to the others, a bivariate analysis is conducted, where we can observe how the variables interact.
The conditional distribution of pair of variables can be represented in order to visualize the dispersion of the variables and observe the behaviour of one of the variables in terms of the other. By doing so, we can explore whether government quality or wealth levels are determinants of sustainability development.
In the following interactive plot we can easily observe the levels of overall SDGs score in relation to the governance effectiveness. The results show an increasing pattern, where higher levels of sustainability progress are achieved when government quality increases.
p = ggplot(merged_data, aes(x=GovernmentEffectiveness, y=Overall_Score, group=Region, size=Overall_Score, color=GDPpercapita, text=Country_Name)) + geom_point(alpha=0.9) +
geom_point(alpha = 0.9, size = 3) +
facet_wrap(~ Region) +
scale_color_gradient(low="lightblue", high="darkblue") +
theme_minimal()+ theme(legend.position="none") +
labs(title = "World countries: Overall Score vs Government Effectiveness", subtitle="(color denotes GDP)",
x = "GovernmentEffectiveness", y = "Overall Score")
ggplotly(p, tooltip=c("Country_Name"))
Once again, the map reveals a positive association between wealth and sustainability goals achievement, as shown by the increasing overall score in the SDGs with rising GDP per capita.
p=ggplot(merged_data, aes(x=log_GDP, y=Overall_Score, group=Region, size=Overall_Score, color=GovernmentEffectiveness, text=Country_Name)) + geom_point(alpha=0.9) +
geom_point(alpha = 0.9, size = 3) +
facet_wrap(~ Region) +
scale_color_gradient(low="lightblue", high="darkblue") +
theme_minimal()+ theme(legend.position="none") +
labs(title = "World countries: Overall Score vs GDP per capita", subtitle="(color denotes Government Effectiveness)",
x = "GDPpercapita", y = "Overall Score")
ggplotly(p, tooltip=c("Country_Name"))
As expected, in the conditional box plot for income level and overall score, we can notice how higher sustainable scores are obtained by countries with higher income levels, with lower scores correspond to countries classified as low-income. These results suggest that as the income level increase, the scores for environmental development also rise.
# Income level - Overall Score
conditional_bx_1 <- ggplot(merged_data, aes(x = Income_group, y = Overall_Score, fill = Income_group)) +
geom_boxplot() +
scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)), name = "Set2")) +
theme(legend.position="none")
conditional_bx_1
Regarding income level and GDP per capita, we observe the same pattern as before, where higher income levels lead to greater levels of GDP per capita and viceversa.
# Income level - GDP
conditional_bx_2 <- ggplot(merged_data, aes(x=Income_group, y=log(GDPpercapita), fill = Income_group)) +
geom_boxplot() +
scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)), name = "Set2")) +
theme(legend.position="none")
conditional_bx_2
Lastly, the conditional distribution of government effectiveness in terms of income level reveals that higher levels of government quality is achieved in countries with higher income levels, while weaker governance is associated with low-income countries. This implies that the level of government effectiveness increases with the income level.
# Income level - Government
conditional_bx_3 <- ggplot(merged_data, aes(x=Income_group, y=GovernmentEffectiveness, fill = Income_group)) +
geom_boxplot() +
scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)), name = "Set2")) +
theme(legend.position="none")
conditional_bx_3
Building upon the descriptive analyses discussed in the preceding sections, we can extends the study by conducting associative analysis. This method enable us to assess the presence of consistent and stable linkages between the levels of the variables, in simple terms, whether there exist relationship among the variables under study.
Despite the primary objective of this report is to examine whether there are homogeneous groups of countries in relation to the achievement of the Sustainable Development Goals, we are also interested in analysing the association between the socioeconomic factors and the goals assessment as well as identifying potential relationships between the 17 goals, given that we need at least a slight degree of association between the variables in order to classify countries into homogeneous clusters.
Our objective is to assess the stability and significance of relationships among our variables (between socioeconomic variables and goals, as well as among each of the goals). First, we are interested in identifying the presence of such relationship, and subsequently, analysing the direction as well the strength of the relationship, if any. To do so, we compute the correlation matrix for the quantitative variables of our dataset.
data <- data.frame(merged_data[1:152, c(3,5,8,9,10,11,12,13,14,15,16,17,18)])
# correlation matrix
cor_matrix <- round(cor(data),2)
cor_matrix
## GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness 1.00 0.83 0.76
## log_GDP 0.83 1.00 0.79
## Overall_Score 0.76 0.79 1.00
## Goal_1_Score 0.63 0.77 0.86
## Goal_2_Score 0.58 0.50 0.67
## Goal_3_Score 0.77 0.84 0.92
## Goal_4_Score 0.68 0.73 0.86
## Goal_5_Score 0.61 0.60 0.67
## Goal_6_Score 0.65 0.70 0.86
## Goal_7_Score 0.53 0.62 0.78
## Goal_8_Score 0.75 0.70 0.75
## Goal_9_Score 0.87 0.87 0.83
## Goal_10_Score 0.36 0.37 0.49
## Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness 0.63 0.58 0.77 0.68
## log_GDP 0.77 0.50 0.84 0.73
## Overall_Score 0.86 0.67 0.92 0.86
## Goal_1_Score 1.00 0.52 0.87 0.81
## Goal_2_Score 0.52 1.00 0.59 0.61
## Goal_3_Score 0.87 0.59 1.00 0.85
## Goal_4_Score 0.81 0.61 0.85 1.00
## Goal_5_Score 0.47 0.48 0.60 0.63
## Goal_6_Score 0.71 0.56 0.79 0.70
## Goal_7_Score 0.75 0.46 0.74 0.68
## Goal_8_Score 0.57 0.60 0.70 0.63
## Goal_9_Score 0.73 0.60 0.83 0.72
## Goal_10_Score 0.46 0.28 0.46 0.29
## Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness 0.61 0.65 0.53 0.75
## log_GDP 0.60 0.70 0.62 0.70
## Overall_Score 0.67 0.86 0.78 0.75
## Goal_1_Score 0.47 0.71 0.75 0.57
## Goal_2_Score 0.48 0.56 0.46 0.60
## Goal_3_Score 0.60 0.79 0.74 0.70
## Goal_4_Score 0.63 0.70 0.68 0.63
## Goal_5_Score 1.00 0.61 0.48 0.60
## Goal_6_Score 0.61 1.00 0.64 0.67
## Goal_7_Score 0.48 0.64 1.00 0.46
## Goal_8_Score 0.60 0.67 0.46 1.00
## Goal_9_Score 0.60 0.74 0.56 0.72
## Goal_10_Score 0.13 0.38 0.22 0.36
## Goal_9_Score Goal_10_Score
## GovernmentEffectiveness 0.87 0.36
## log_GDP 0.87 0.37
## Overall_Score 0.83 0.49
## Goal_1_Score 0.73 0.46
## Goal_2_Score 0.60 0.28
## Goal_3_Score 0.83 0.46
## Goal_4_Score 0.72 0.29
## Goal_5_Score 0.60 0.13
## Goal_6_Score 0.74 0.38
## Goal_7_Score 0.56 0.22
## Goal_8_Score 0.72 0.36
## Goal_9_Score 1.00 0.43
## Goal_10_Score 0.43 1.00
# correlation matrix with p-values
p_matrix <- rcorr(as.matrix(data))
p_matrix
## GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness 1.00 0.83 0.76
## log_GDP 0.83 1.00 0.79
## Overall_Score 0.76 0.79 1.00
## Goal_1_Score 0.63 0.77 0.86
## Goal_2_Score 0.58 0.50 0.67
## Goal_3_Score 0.77 0.84 0.92
## Goal_4_Score 0.68 0.73 0.86
## Goal_5_Score 0.61 0.60 0.67
## Goal_6_Score 0.65 0.70 0.86
## Goal_7_Score 0.53 0.62 0.78
## Goal_8_Score 0.75 0.70 0.75
## Goal_9_Score 0.87 0.87 0.83
## Goal_10_Score 0.36 0.37 0.49
## Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness 0.63 0.58 0.77 0.68
## log_GDP 0.77 0.50 0.84 0.73
## Overall_Score 0.86 0.67 0.92 0.86
## Goal_1_Score 1.00 0.52 0.87 0.81
## Goal_2_Score 0.52 1.00 0.59 0.61
## Goal_3_Score 0.87 0.59 1.00 0.85
## Goal_4_Score 0.81 0.61 0.85 1.00
## Goal_5_Score 0.47 0.48 0.60 0.63
## Goal_6_Score 0.71 0.56 0.79 0.70
## Goal_7_Score 0.75 0.46 0.74 0.68
## Goal_8_Score 0.57 0.60 0.70 0.63
## Goal_9_Score 0.73 0.60 0.83 0.72
## Goal_10_Score 0.46 0.28 0.46 0.29
## Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness 0.61 0.65 0.53 0.75
## log_GDP 0.60 0.70 0.62 0.70
## Overall_Score 0.67 0.86 0.78 0.75
## Goal_1_Score 0.47 0.71 0.75 0.57
## Goal_2_Score 0.48 0.56 0.46 0.60
## Goal_3_Score 0.60 0.79 0.74 0.70
## Goal_4_Score 0.63 0.70 0.68 0.63
## Goal_5_Score 1.00 0.61 0.48 0.60
## Goal_6_Score 0.61 1.00 0.64 0.67
## Goal_7_Score 0.48 0.64 1.00 0.46
## Goal_8_Score 0.60 0.67 0.46 1.00
## Goal_9_Score 0.60 0.74 0.56 0.72
## Goal_10_Score 0.13 0.38 0.22 0.36
## Goal_9_Score Goal_10_Score
## GovernmentEffectiveness 0.87 0.36
## log_GDP 0.87 0.37
## Overall_Score 0.83 0.49
## Goal_1_Score 0.73 0.46
## Goal_2_Score 0.60 0.28
## Goal_3_Score 0.83 0.46
## Goal_4_Score 0.72 0.29
## Goal_5_Score 0.60 0.13
## Goal_6_Score 0.74 0.38
## Goal_7_Score 0.56 0.22
## Goal_8_Score 0.72 0.36
## Goal_9_Score 1.00 0.43
## Goal_10_Score 0.43 1.00
##
## n= 152
##
##
## P
## GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness 0.0000 0.0000
## log_GDP 0.0000 0.0000
## Overall_Score 0.0000 0.0000
## Goal_1_Score 0.0000 0.0000 0.0000
## Goal_2_Score 0.0000 0.0000 0.0000
## Goal_3_Score 0.0000 0.0000 0.0000
## Goal_4_Score 0.0000 0.0000 0.0000
## Goal_5_Score 0.0000 0.0000 0.0000
## Goal_6_Score 0.0000 0.0000 0.0000
## Goal_7_Score 0.0000 0.0000 0.0000
## Goal_8_Score 0.0000 0.0000 0.0000
## Goal_9_Score 0.0000 0.0000 0.0000
## Goal_10_Score 0.0000 0.0000 0.0000
## Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness 0.0000 0.0000 0.0000 0.0000
## log_GDP 0.0000 0.0000 0.0000 0.0000
## Overall_Score 0.0000 0.0000 0.0000 0.0000
## Goal_1_Score 0.0000 0.0000 0.0000
## Goal_2_Score 0.0000 0.0000 0.0000
## Goal_3_Score 0.0000 0.0000 0.0000
## Goal_4_Score 0.0000 0.0000 0.0000
## Goal_5_Score 0.0000 0.0000 0.0000 0.0000
## Goal_6_Score 0.0000 0.0000 0.0000 0.0000
## Goal_7_Score 0.0000 0.0000 0.0000 0.0000
## Goal_8_Score 0.0000 0.0000 0.0000 0.0000
## Goal_9_Score 0.0000 0.0000 0.0000 0.0000
## Goal_10_Score 0.0000 0.0005 0.0000 0.0002
## Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness 0.0000 0.0000 0.0000 0.0000
## log_GDP 0.0000 0.0000 0.0000 0.0000
## Overall_Score 0.0000 0.0000 0.0000 0.0000
## Goal_1_Score 0.0000 0.0000 0.0000 0.0000
## Goal_2_Score 0.0000 0.0000 0.0000 0.0000
## Goal_3_Score 0.0000 0.0000 0.0000 0.0000
## Goal_4_Score 0.0000 0.0000 0.0000 0.0000
## Goal_5_Score 0.0000 0.0000 0.0000
## Goal_6_Score 0.0000 0.0000 0.0000
## Goal_7_Score 0.0000 0.0000 0.0000
## Goal_8_Score 0.0000 0.0000 0.0000
## Goal_9_Score 0.0000 0.0000 0.0000 0.0000
## Goal_10_Score 0.1107 0.0000 0.0060 0.0000
## Goal_9_Score Goal_10_Score
## GovernmentEffectiveness 0.0000 0.0000
## log_GDP 0.0000 0.0000
## Overall_Score 0.0000 0.0000
## Goal_1_Score 0.0000 0.0000
## Goal_2_Score 0.0000 0.0005
## Goal_3_Score 0.0000 0.0000
## Goal_4_Score 0.0000 0.0002
## Goal_5_Score 0.0000 0.1107
## Goal_6_Score 0.0000 0.0000
## Goal_7_Score 0.0000 0.0060
## Goal_8_Score 0.0000 0.0000
## Goal_9_Score 0.0000
## Goal_10_Score 0.0000
The correlation data can be also visualized in the following correlation plot, where the results reveal strong relationships between some of the variables. For instance, there is a high correlation between SDG 3 (Good Health and Well-being) and GDP per capita, implying that a country’s overall well-being is related to its level of GDP per capita. Furthermore, we observe a high correlation between SDG 3 (Good Health and Well-being) and SD4 (Quality Education), suggesting that countries with good quality education also tend to have high levels of health and well-being. Additionally, the matrix reveals a strong relationship between SDG 6 (Clean Water and Sanitation) and SDG 9 (Industry, Innovation and Infrastructure), indicating a link between access to clean water and sanitation and the development of industry and infrastructure.
library(Hmisc)
library(corrplot)
library(PerformanceAnalytics)
# Visualization of the data matrix
corrplot(cor_matrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
Additionally, we can represent the correlation between GDP per capita, Government Effectiveness, and Overall Score, together with its distribution and scatter plot in the matrix below. In this way, we observe that the three variables exhibit normal distribution and also, there exists a strong correlation between them. Moreover, the asterisks reveal that such relationships between the variables is significant.
# Matrix of scatter plots
my_data <- data[ c('log_GDP','Overall_Score','GovernmentEffectiveness')]
chart.Correlation(my_data, histogram=TRUE,pch="+")
In this section we focus on the Principal Component Analysis (PCA), a technique focused on representing multivariate data with a smaller number of variables without significant loss of information, allowing to find hidden relationships between variables.
Performing PCA can help us identifying which SDGs and other economic and political variables are most influential in explaining the variability in our data. Subsequently, we can use these components to assess the progress of each country. Therefore, to perform the PCA analysis, we extract the numeric variables from our original dataset and scale them, retaining only those corresponding to the 17 SDGs, GDP per capita (log transformation), and Government Effectiveness.
# Extract and scale variables
data = merged_data %>% dplyr::select(-c(Country_Name, Country_Code, Region, Income_group, Overall_Score, GDPpercapita))
# PCA
pca = prcomp(data, scale=T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.2709 1.22182 1.1342 0.95584 0.88062 0.83087 0.76201
## Proportion of Variance 0.5631 0.07857 0.0677 0.04809 0.04082 0.03633 0.03056
## Cumulative Proportion 0.5631 0.64168 0.7094 0.75746 0.79828 0.83461 0.86518
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.66613 0.63385 0.55701 0.53691 0.4988 0.46729 0.40001
## Proportion of Variance 0.02335 0.02115 0.01633 0.01517 0.0131 0.01149 0.00842
## Cumulative Proportion 0.88853 0.90968 0.92600 0.94118 0.9543 0.96577 0.97419
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.37039 0.35524 0.29374 0.27668 0.25342
## Proportion of Variance 0.00722 0.00664 0.00454 0.00403 0.00338
## Cumulative Proportion 0.98141 0.98805 0.99259 0.99662 1.00000
Given the PCA results, we first focus on the standard deviation of each principal component (PC), which indicates the spread of the data along that component. As we can observe, PC1 has the highest standard deviation, suggesting that PC1 captures the most variation in the data, followed by PC2 and PC3.
In relation to the proportion of variance, which explains how much of the total variability is captured by each component, the results show that PC1 explains the highest proportion of variance, with a value equal to 56.4%, followed by PC2 and PC3, whose proportions are around 7.9% and 6.6%, respectively.
Additionally, we can visualize the previous findings in the following plot, where we observe that the first and second principal components capture 64% of the total variability. Furthermore, using the first three principal components together we can explain around 71% of the total variance.
Based on this analysis, the results suggest that the first few principal components are essential for capturing the variability in the data. Conversely, the remaining components are able to explain less variability in the data.
fviz_screeplot(pca, addlabels = TRUE)
In this step we focus on analysing the First Principal Component. The following plot, where positive values indicate that higher values of the variable are associated with higher values of PC1 and negative values indicate that higher values of the variable are associated with lower values of PC1, suggests that GDP, Goal 3, Goal 9 and Goal 16 are the variables that mostly contribute to the PC1. Conversely, Goal 14 and Goal 15 depict bars close to zero line, meaning that they have less influence on PC1 and consequently, contribute less to explaining the variability in the data. Further, we can also observe that bars corresponding to Goal 12 and Goal 13 are below zero, indicating a negative association with the PC1.
barplot_pc1 <- barplot(pca$rotation[,1], las=2, col="lightblue")
Additionally, we can also visualize the contribution of the variables to the First Principal Component in the chart below, being the variables located to left the ones that contribute the most. This way, we can assume that Goal 3, Goal 9, and GDP are the variables with the highest contributions to PC1 and therefore, they are the most important for explaining the variability captured by PC1. In contrast, Goal 14 and Goal 15, which are the variables located to the right side, contribute less to PC1.
Moreover, the red dashed line, which indicates the expected average contribution of the variables, help us identify which of them make a more significant contribution to the principal component. This way, those variables whose contributions exceed this red line are considered to be more influential in explaining the variability captured by the principal component.
fviz_contrib(pca, choice = "var", axes = 1)
In this sense, the countries with higher scores can be interpreted as making more progress towards achieving the SDGs, as it is the case of Norway, Denmark, Sweden, Austria, Finland, New Zealand, Netherlands, Switzerland, Germany, and Ireland. Conversely, those countries with lower scores can be interpreted as struggling in assessing the SDGs, such as Central African Republic, South Sudan, Chad, Somalia, Afghanistan, Sudan, Liberia, Haiti, Madagascar and Niger.
Country = merged_data$Country_Name
Region = merged_data$Region
Income_group = merged_data$Income_group
low_progress <- Country[order(pca$x[,1])][1:5]
high_progress <- Country[order(pca$x[,1], decreasing=T)][1:5]
Next, we examine the Second Principal Component. In the following bar plot we can visualize the contribution of each variable to the Second Principal Component, helping to understand the patterns in the data captured by that component.
In this sense, we can notice that the PC2 is negatively associated with two of the variables, corresponding to SDG 14 (Life Bellow Water) and SDG 15 (Life on Land), implying that higher values of sustainable use of the oceans and the territorial ecosystems are associated with lower values of PC2. Regarding the rest of the variables, we can observe that their contribution to the PC2 is relatively low, given their proximity to the zero line.
barplot(pca$rotation[,2], las=2, col="darkblue")
Alternatively, we can also visualize the contribution of each SDG to the Second Principal Component in the plot depicted below. In line with our previous conclusions, the variables located in the left side exceeding the red dashed line, are SDG 14 (Life Bellow Water) and SDG 15 (Life on Land), implying that they are the ones that contribute the most to the PC2.
fviz_contrib(pca, choice = "var", axes = 2)
To get more insights about the PC2, we rank the countries using this component. By doing so, we can interpret their positions in relation to the two influential variables identified (Life Bellow Water and Life on Land). Accordingly, countries with higher positive scores on PC2 are more positively associated with these two environmental SDG, as it is the case of Namibia and Cuba, while those with lower scores are less associated, such as Singapore or Bahrain.
low_contribution <- Country[order(pca$x[,2])][1:5]
print(low_contribution )
## [1] "Namibia" "Cuba" "Suriname" "Finland" "Estonia"
high_contribution <- Country[order(pca$x[,2], decreasing=T)][1:5]
print(high_contribution)
## [1] "Singapore" "Mauritius" "Bahrain" "Israel" "Guyana"
After analysing the First and Second Principal Components, which together explain 64% of the variability in our data, we plot the scores on both components for each country . This way, we can visualize that countries grouped together on the plot have very similar scores in PC1 and PC2, suggesting that they have similar patterns in their data and, consequently, implying similar progress in the assessment of the Sustainability Development Goals. In addition, the color of the points indicates the region to which the countries belong, allowing us to identify potential similarities across countries belonging to the same region.
Accordingly, we observe similar patterns between European countries, since all of them are located together, implying that their PC1 and PC2 scores are quite close. The same applies for most of the Sub-Saharan Africa and Latin America & Caribbean regions, whose countries are plotted together in the left and middle of the chart, respectively.
Conversely, countries belonging to the remaining regions (East Asia & Pacific, Middle East & North Africa, North America and South Asia) do not show similarities, as they are separately located on this chart. In this sense, we can observe how the United States, Japan, and Australia show similar patterns to those exhibited by European countries, despite belonging to another geographical region. Additionally, we can also observe how the scores of Haiti and Pakistan resemble to the patterns depicted by countries in Sub-Saharan Africa, despite being geographically distant.
data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label = Country,color = Region)) + geom_point(size=0) +
labs(title="PC1 and PC2 scores", x="PC1", y="PC2") +
guides(color=guide_legend(title = "Region")) +
theme_bw() +
theme(legend.position="bottom") +
geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
Furthermore, it might be interesting to visualize the scores for PC1 and PC2 for each country, but distinguish between income groups instead of geographical regions. This can provide insights into how economic factors contribute to the positioning of countries in the plot.
In this way, we can observe how four differentiated groups of countries appear, coinciding with the four income level classifications. This result suggest that income level is more determinant than geographical location when assessing the degree of achievement of the Sustainability Development Goals.
data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label = Country,color = Income_group)) + geom_point(size=0) +
labs(title="PC1 and PC2 scores", x="PC1", y="PC2") +
guides(color = guide_legend(title="Income Group")) +
theme_bw() +
theme(legend.position="bottom") +
geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
Below we observe that the countries that have achieved a greater sustainability development are those belonging to the high-income level group.
data.frame(z1=pca$x[,1],Income_group) %>%
group_by(Income_group) %>%
summarise(mean = mean(z1), n=n()) %>%
arrange(desc(mean))
## # A tibble: 4 × 3
## Income_group mean n
## <fct> <dbl> <int>
## 1 High income 3.54 49
## 2 Upper middle income 0.421 41
## 3 Lower middle income -2.14 42
## 4 Low income -5.04 20
Lastly, we can visualize the PC1 scores on a world map, where darker areas represent higher scores along PCA1, while lighter shades represent lower scores captured by the PCA1. This allows us to identify clusters of countries with similar scores, which may indicate similarities in the data in terms of the assessment of the Sustainability Development Goals.
map = data.frame(country = Country, value=pca$x[,1])
map$country = countrycode(map$country, 'country.name', 'iso3c')
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
addLegend = TRUE, borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "heat", #white2Black black2White palette heat topo terrain rainbow negpos8 negpos9
mapTitle = c("PCA1 by Country"), lwd=1)
According to the results, we observe distinct patterns: countries shaded
in red and dark orange areas exhibit better performance regarding SDGs
achievement. These countries are predominantly situated in Europe, North
America and Australia, corresponding with higher income levels.
Conversely, countries shaded in yellow and light orange struggle to
achieve the SDGs, primarily situated in Sub-Saharan Africa and South
Asia. Furthermore, countries in Latin America and East Asia demonstrate
moderate progress in the SDGs, represented by the orange colour, which
falls in the middle point of the scale. Moreover, it should be noted
that there is no overlap between colors on the map, indicating a close
resemblance between the environmental map and the geographic
distribution.
After conducting a Principal Component Analysis (PCA) to reduce the dimensionality of our dataset and uncover underlying patterns, our next step involves applying clustering techniques. Clustering is a fundamental tool in unsupervised learning that groups similar observations together based on certain characteristics, allowing us to identify subgroups within our data. In this section, we will use clustering algorithms to explore the structure of our data and gain insights into distinct clusters of countries in relation to the attainment of the Development Sustainability Goals.
In clustering, selecting the appropriate number of clusters is one of the most important steps. Therefore, we first try various initial guesses to determine the most suitable number of clusters for our dataset.
After performing K-means clustering with 5 clusters, we obtain the following cluster sizes 39, 30, 45,and 38, indicating the number of countries grouped into each cluster. To analyse the distinct clusters, we can observe their centroids, which represent the mean values of the variables within each cluster. These results allows us to observe how each cluster exhibits different average levels of scores on the Sustainable Development Goals (SDGs):
Cluster 1: This cluster consists of countries with relatively lower scores across most SDGs compared to other clusters. While they show moderate scores in Goal 2 (Zero Hunger) and Goal 4 (Quality Education), they exhibit particularly low scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action). These countries might face challenges in industrial development, innovation, and addressing climate change.
Cluster 2: Countries in this cluster demonstrate moderate to high scores across most SDGs, with notable strengths in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 7 (Affordable and Clean Energy). However, they show relatively lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 10 (Reduced Inequalities), indicating areas for improvement in industrial development and reducing inequalities.
Cluster 3: This cluster represents countries with relatively high scores across most SDGs, particularly excelling in Goal 1 (No Poverty), Goal 4 (Quality Education), and Goal 12 (Responsible Consumption and Production). However, they exhibit lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action), suggesting a need for more focus on industrial development and addressing climate change challenges.
Cluster 4: Countries in this cluster demonstrate high scores across most SDGs, showing strong performance in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 16 (Peace, Justice, and Strong Institutions). However, they exhibit relatively lower scores in Goal 12 (Responsible Consumption and Production) and Goal 13 (Climate Action), indicating a need for more sustainable consumption patterns and efforts to combat climate change.
clustering_4 = kmeans(data, centers = 4, nstart=1000)
clustering_4
## K-means clustering with 4 clusters of sizes 42, 38, 42, 30
##
## Cluster means:
## GovernmentEffectiveness log_GDP Goal_1_Score Goal_2_Score Goal_3_Score
## 1 -0.2106788 8.665975 90.52854 60.07368 73.59451
## 2 -0.9749605 7.073414 26.59214 50.93251 41.30747
## 3 1.0824422 10.606880 99.40765 66.75948 90.85321
## 4 -0.2181050 8.832832 82.49123 58.90180 71.17142
## Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1 83.10379 57.82757 68.70789 72.78115 66.55840 40.74573
## 2 41.55357 47.69078 50.48503 42.83730 57.81520 16.23028
## 3 96.23095 76.42938 81.08382 75.78783 79.21510 81.33778
## 4 83.33559 67.62934 70.33072 72.88301 66.12517 43.15647
## Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1 75.83504 71.32693 87.15251 85.89809 61.32298
## 2 47.41955 47.50482 95.70163 97.90572 66.63736
## 3 85.40417 86.06820 65.86111 52.85738 62.91336
## 4 26.41953 77.07228 88.40605 85.50198 67.69310
## Goal_15_Score Goal_16_Score Goal_17_Score
## 1 61.59968 69.25757 61.91748
## 2 64.20708 51.18080 50.87273
## 3 73.22698 81.75931 63.08022
## 4 62.50322 63.42087 62.07891
##
## Clustering vector:
## [1] 2 1 1 2 4 1 3 3 1 1 1 1 3 3 4 2 1 4 1 4 4 3 4 2 2 1 2 3 2 2 4 4 4 4 2 3 1
## [38] 3 3 2 4 4 1 3 2 2 1 3 3 1 1 3 4 3 4 2 1 2 4 3 3 4 1 1 3 3 3 4 3 1 1 2 3 1
## [75] 1 3 1 2 2 3 3 2 2 4 1 2 3 2 1 4 1 1 1 1 2 1 4 1 3 3 4 2 2 1 3 4 2 4 2 4 4
## [112] 4 3 3 3 1 3 2 2 1 2 1 2 3 3 3 2 4 2 3 1 2 4 3 3 2 1 2 1 2 3 1 4 2 1 3 3 3
## [149] 1 1 2 4
##
## Within cluster sum of squares by cluster:
## [1] 87773.59 110865.46 86476.61 68532.30
## (between_SS / total_SS = 63.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Furthermore, the centroids can be represented in the following bar plots
to visualize the different average of the variables across clusters.
These results suggest that there exist a distinct cluster formed by
countries with a very good performance on SDG and conversely, a cluster
whose countries depict a relatively low level of SDG progress. The other
two remaining clusters show very similar characteristics, as the exhibit
good overall performance.
centers=clustering_4$centers
barplot(centers[1,], las=2, col="darkblue")
barplot(centers[2,], las=2, col="darkblue")
barplot(centers[3,], las=2, col="darkblue")
barplot(centers[4,], las=2, col="darkblue")
Finally, we can illustrate the five distinct groups of countries in a
cluster map, where each cluster is represented with the countries
inside. As we can observe, the clusters overlap in the plot. However, we
can distinguish between the cluster with the lowest SDG achievement,
located on the left, and the cluster with the best performance, located
on the right. Regarding the two groups in the middle, where overlapping
is higher, they correspond to the two similar clusters with a
medium-high progress of the SDG.
fviz_cluster(clustering_4, data = data, geom = c("point"),ellipse.type = 'norm', pointsize=1) +
theme_minimal() +
geom_text(label=Country,hjust=0, vjust=0,size=2,check_overlap = F) +
scale_fill_brewer(palette="Paired")
Next, we attempt to improve the interpretation by performing K-means clustering with 3 clusters to reduce overlapping and facilitate understanding of the results. However, it’s important to note that using 3 clusters instead of 4 doesn’t necessarily mean it’s a better choice.
According to the results shown below, we obtain 3 clusters of size 73, 40 and 39, which exhibit the following characteristics:
Cluster 1: This cluster represents countries with moderate to high scores across most Sustainable Development Goals (SDGs). They show strengths in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 7 (Affordable and Clean Energy), with relatively lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action).
Cluster 2: Countries in this cluster demonstrate high scores across most SDGs, particularly excelling in Goal 1, Goal 4 (Quality Education), and Goal 9. However, they exhibit lower scores in Goal 12 (Responsible Consumption and Production), suggesting a need for more sustainable consumption patterns.
Cluster 3: This cluster comprises countries with relatively lower scores across most SDGs compared to the other clusters. They show particularly low scores in Goal 9 and Goal 10 (Reduced Inequalities), indicating challenges in industrial development and reducing inequalities.
clustering_3 = kmeans(data, centers=3, nstart=1000)
clustering_3
## K-means clustering with 3 clusters of sizes 70, 43, 39
##
## Cluster means:
## GovernmentEffectiveness log_GDP Goal_1_Score Goal_2_Score Goal_3_Score
## 1 -0.2224941 8.729759 87.56921 59.75460 72.81891
## 2 1.0606048 10.577636 99.42052 66.58298 90.60675
## 3 -0.9489487 7.110593 27.11874 50.86133 41.49274
## Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1 83.17859 61.42041 69.47511 73.36817 66.40066 41.83930
## 2 95.99746 76.31619 80.94542 75.49040 79.02669 80.52588
## 3 42.58386 48.68966 50.65881 42.88066 57.87245 16.60484
## Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1 55.35973 73.75308 87.70842 86.16175 63.96060
## 2 85.67345 85.89875 66.31418 52.85245 62.57510
## 3 46.34451 47.98935 95.49528 97.67252 66.99917
## Goal_15_Score Goal_16_Score Goal_17_Score
## 1 61.68326 66.78235 61.87549
## 2 72.85695 81.45622 62.88104
## 3 64.79509 51.61087 51.54528
##
## Clustering vector:
## [1] 3 1 1 3 1 1 2 2 1 1 1 1 2 2 1 3 1 1 1 1 1 2 1 3 3 1 3 2 3 3 1 1 1 1 3 2 1
## [38] 2 2 3 1 1 1 2 3 3 1 2 2 1 1 2 1 2 1 3 1 3 1 2 2 1 1 1 2 2 2 1 2 1 2 3 2 1
## [75] 1 2 1 3 3 2 2 3 3 1 1 3 2 3 1 1 1 1 1 1 3 1 3 1 2 2 1 3 3 1 2 1 3 1 3 1 1
## [112] 1 2 2 2 1 2 3 3 1 3 1 3 2 2 2 3 1 3 2 1 3 1 2 2 3 1 3 1 3 2 1 1 3 1 2 2 2
## [149] 1 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 192574.87 89460.91 118505.27
## (between_SS / total_SS = 58.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Again, we can visualize the centroids of each cluster in bar plots, which facilitates the interpretation of the differences obtained from the results above.
centers=clustering_3$centers
barplot(centers[1,], las=2, col="darkblue")
barplot(centers[2,], las=2, col="darkblue")
barplot(centers[3,], las=2, col="darkblue")
Finally, we can represent the three clusters on a cluster map, where each group is illustrated in a different color. We observe that overlapping in the plot is lower than with 4 clusters, which facilitates the interpretation of the results. Accordingly, we can differentiate three clear groups of countries: a cluster with lower SDG progress, located on the left; a cluster with good performance, plotted in the middle of the map; and lastly, a cluster with high SDG performance, situated on the right side.
Accordingly, we can differentiate three clear groups on the map: countries with poor performance in the SDGs, located on the left; countries performing very well in SDG assessment, situated on the right side; and lastly, countries with a neutral performance, plotted in the middle of the map
# clusplot
fviz_cluster(clustering_3, data = data, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal() +
geom_text(label = Country,hjust=0, vjust=0,size=2,check_overlap = F) +
scale_fill_brewer(palette="Paired")
As previously mentioned, the number of groups is indeed a key point of clustering analysis. Various methods can provide hints or guidance in determining the optimal number of clusters. Three common techniques are Within Cluster Sums of Squares, Average Silhouette and Gap statistics. By considering the results from these methods collectively, we can gain insights into the most suitable number of clusters for our dataset.
According to this method, the optimal number of clusters is the one located at the point where the total within-cluster sum of squares decreases slower after adding another cluster. Taking a look at the plot, we can notice how at point k = 3 the, WCSS begins to slow down, and a smooth decrease takes place. Therefore, under this method, the suggested number of groups is 3.
fviz_nbclust(scale(data), kmeans, method = 'wss', k.max = 20, nstart = 1000) # smooth decrease stars at k = 3: with this graph we get the hint that 3 groups might be the best
By using Average Silhouette the optimal number of clusters is the peak score, implying that a high average silhouette width indicates a good clustering. Therefore the plot below provides the hint that the 2 clusters might be the most suitable number of groups for our data.
fviz_nbclust(scale(data), kmeans, method = 'silhouette', k.max = 20, nstart = 1000) # with this formula, the higher the better: again the optimal is 2 groups
The optimal number of clusters under the gap statistics method is the point where the gap statistic first reaches a peak. Accordingly, the optimal number of clusters suggested is 8, as the maximum gap statistic is reached at k = 8.
fviz_nbclust(scale(data), kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)
In relation to the hints provided by the three methods, it seems suitable to set a number of clusters between 2 and 8 to group our observations. Furthermore, since we are interested in exploring whether the groups of countries with similar attainment of the Development Sustainability Goals also converge in terms of their income level classification, we finally determine that the most appropriate number of groups for our analysis might be 4, which is a number that falls between the optimal ones suggested by the three methods and also allow us to check whether a low-income, middle- income, and high-income differentiation takes place within the three groups.
fit.kmeans_4 = kmeans(data, centers=4, nstart=1000)
map_4 = data.frame(country = Country, value=fit.kmeans_4$cluster)
map_4$country = countrycode(map_4$country, 'country.name', 'iso3c')
matched <- joinCountryData2Map(map_4, joinCode = "ISO3",
nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "heat",
mapTitle = c("Clusters"), lwd=1)
Finally, since we are interested not only in clustering countries based on their performance on the Development Sustainability Goals but also in analysing whether they share similar characteristics in terms of income level, we decide that the most suitable number of groups for our analysis is 4. As observed in the plot map above, this number of clusters leads to 4 differentiated groups of countries in terms of DSG performance, which also allows us to differentiate the four income levels within these groups.
The results reveal clear trends: countries shaded in red exhibit stronger performance in SDG achievement, largely clustered in Europe, North America, Australia, and New Zealand, reflecting their higher income levels. We can also notice a cluster formed by regions in the north of Asia and north of Africa, as well as some pacific islands, depicting similar levels of sustainable development. Conversely, countries shaded in yellow and light orange face challenges in meeting the SDGs, mainly located in Sub-Saharan Africa and Latin America. This patterns align with countries having lower income levels compared to those in more developed regions.
In this section we focus on hierarchical clustering, which is another clustering technique to group similar observation into clusters. The main characteristic of hierarchical clustering is that is organizes the data in a hierarchical structure, allowing us to explore potential hierarchical relationships within the data
One of the key advantages of hierarchical clustering is its ability to reveal hierarchical relationships within the data, allowing for a flexible and intuitive exploration of cluster structures.
Unlike other clustering methods, hierarchical clustering does not require the specification of the number of clusters. However, in this method, what is important is deciding the distance between the observations and the linkage to join groups.
Distance metrics measure the dissimilarity or similarity between data points, while linkage criteria determine how clusters are merged or split at each step of the algorithm.
d = dist(scale(data), method = "euclidean")
hc <- hclust(d, method = "ward.D2")
There are several methods to visualize the hierarchical clustering, facilitating the interpretation of the results. In this section, we will use classical dendrograms, phylogenic trees, geographical maps, and heatmaps.
As mentioned before, hierarchical clustering organizes the countries in a hierarchical structure, which can be visualized in a dendrogram. This dendrogram visually represents the relationships between observations, showing how they cluster together based on their similarities.
hc$labels <- Country
dend_plot <- fviz_dend(x = hc,
k=4,
palette = "jco",
rect = TRUE, rect_fill = TRUE, cex=0.5,
rect_border = "jco",
)
dend_plot
In the dendrogram above, distinguising the countries is quite difficult. However, two major hierarchical clusters can be noticed. On the one hand, there is a group corresponding to low SDGs performance on the left, from which two new clusters emerge, leading at the same time to lower clusters. On the other hand, there is a cluster on the right, including countries with better performance. From this cluster, two new clusters emerge, dividing observations into a group with very good attainment group and another with modest performance. Within the latter, more clusters of lower hierarchical categories from, futher clustering the observations of countries with lower performance.
To improve the visualization of the branches of lower hierarchical
levels, we can plot the sub-plots, which allows us to notice the
countries within each cluster at lower levels.
dend_data <- attr(dend_plot, "dendrogram")
dend_cuts <- cut(dend_data, h = 40)
# Left subtree
fviz_dend(dend_cuts$lower[[1]], main = "Subtree 1")
# Right subtree
fviz_dend(dend_cuts$lower[[2]], main = "Subtree 2")
To facilitate the visualization, we represent the data using a phylogenic tree, where results can be easily interpreted than in the previous dendrogram.
clusters <- cutree(hc, k = 4)
fviz_dend(x = hc,
k = 4,
color_labels_by_k = TRUE,
cex = 0.8,
k_color_palette = "jco",
type = "phylogenic",
repel = TRUE) +
labs(title="Socio-economic-health tree clustering of the world") +
theme(axis.text.x=element_blank(),axis.text.y=element_blank())
Given that the observations of our data are countries, it might be also interesting to represent the hierarchical clustering results in a map, where each country is colored according to the cluster it belongs to.
groups.hc = cutree(hc, k = 4)
map = data.frame(country = Country, value = groups.hc)
map$country = countrycode(map$country, 'country.name', 'iso3c')
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "terrane",
mapTitle = c("Clusters"), lwd=1)
According to the patterns displayed on the map, we observe some slight differences between results obtained from hierarchical clustering and the k-means approach. These variances can be attributed to the distinct algorithms and methdologies used by each clustering method, given that hierarchical clustering groups data into a tree-like structure based on similarity, which may result in different clusters compared to k-means.
In this context, we observe that countries shaded in light colors primarily correspond to regions in Latin America, Africa, and parts of Asia, consistent with the findings obtained from the k-means clustering. However, a new pattern emerges in South Asia, where countries appears as a distinct cluster, exhibiting relatively higher levels of sustainable development, followed by the usual developed regions, such as Europe, North America and Australia. These distinctions underscore the importance of considering various clustering methods to gain a comprehensive understanding of the underlying patterns within the data.
Finally, we can use a heatmap, also known as a false color map, which is a way to visualize hierarchical clustering using a color scale, including a dendrogram to the left side and to the top of the plot.
heatmap(scale(data),
scale = "none",
labRow = Country,
col = bluered(100),
distfun = function(x){dist(x, method = "euclidean")},
hclustfun = function(x){hclust(x, method = "ward.D2")},
main = "Heatmap of Data",
xlab = "Sustainabilty Goals", ylab = "Countries",
cexRow = 0.7,
margins = c(7,7))
To interpret the hierarchical clustering patterns illustrated in the
heatmap, we need to observe the similarities and dissimilarities among
both the countries (displayed in rows) and sustainability goals
(represented in columns). In this context, the red color indicate higher
values, while blue represents lower values, helping us to identify which
countries and goals have higher or lower scores relative to the rest.
Moreover, the left dendrogram indicates how the countries are clustered
based on their similarities in achieving the sustainability goals, while
the top dendrogram shows how the sustainability goals are clustered
based on the similarities in their patterns across countries. Therefore,
countries that are closer together on the dendrogram share more similar
profiles in terms of their performance across the goals, such as the
case of the Netherlands, Australia, Hungary and Portugal , which are
located together and are represented by red colors for most of the
goals, indicating a good performance in the assessment of the SDGs.
Conversely, we observe similarities among Mali, Zambia, Rwanda, and
Tanzania, clustered together and shaded in blue for most of the SDGs,
indicating weak enviromental progress. It can be also noticed a cluster
formed by regions with moderate progress, as they are shaded with both
blue and red colors, such as Bolivia, Lebanon and Mauritius.
The study conducted in this study provides a comprehensive understanding of the Sustainable Development Goals (SDGs) performance across countries. This analysis is achieved through the use of Principal Component Analysis (PCA) and clustering techniques, while also linking such sustainable performance with the economic and governance quality of the countries, whose results indicate that income level plays a more significant role than geographical location in determining the degree of achievement of the Sustainability Development Goals.
Through PCA, we identified the key variables contributing to the variability in SDG performance. The analysis revealed that variables such as GDP, Government Effectiveness, and specific SDGs like Goal 3 (Good Health and Well-being) and Goal 16 (Peace, Justice, and Strong Institutions) have a significant impact on the overall variability in SDG performance. Conversely, the sustainable use of the oceans and territorial ecosystems are the variables that less contribute to the achievement of the SDGs.By visualizing the scores on PC1 and PC2, we observed distinct clusters of countries with similar patterns of SDG performance, allowing for deeper insights into regional and income-level disparities. These results support the idea that those countries which converge in terms of SDG performance, also have similar characteristics regarding economic and political structure.
Subsequently, K-means clustering was employed to group countries based on their SDG performance. Initially exploring different numbers of clusters, we found that clustering into 4 groups provided clear distinctions between countries with high, moderate, and low SDG performance. Hierarchical clustering further supported some of these findings, revealing hierarchical relationships between clusters and highlighting the influence of income level on SDG attainment. However, new patterns emerged for some specific clusters. These distinctions in the results obtained from k-means clustering and hierarchical approach show the importance of considering various clustering methods to gain a deeper understanding of the underlying patterns within our data.
Overall, the analysis provided valuable insights into the complex topic of achieving global sustainable development, enabling policymakers and politician to identify priority areas for intervention to address the challenges in SDG performance. At the same time, the analysis pointed out the importance of considering the economic and political structure of each country when addressing sustainable development, as areas with higher income levels and more efficient governance tend to succeed in achieving the SDGs, while those countries with weaker economies and governance quality tend to struggle with sustainable development. Accordingly, by using data approaches like PCA and clustering, we can work towards achieving the goal of sustainable development for all nations.